Minimum Conflict Individual Haplotyping from SNP Fragments and Related Genotype

The Minimum Error Correction (MEC) is an important model for haplotype reconstruction from SNP fragments. However, this model is effective only when the error rate of SNP fragments is low. In this paper, we propose a new computational model called Minimum Conflict Individual Haplotyping (MCIH) as an extension to MEC. In contrast to the conventional approaches, the new model employs SNP fragment information and also related genotype information, thereby a high accurate inference can be expected. We first prove the MCIH problem to be NP-hard. To evaluate the practicality of the new model we design an exact algorithm (a dynamic programming procedure) to implement MCIH on a special data structure. The numerical experience indicates that it is fairly effective to use MCIH at the cost of related genotype information, especially in the case of SNP fragments with a high error rate. Moreover, we present a feed-forward neural network algorithm to solve MCIH for general data structure and large size instances. Numerical results on real biological data and simulation data show that the algorithm works well and MCIH is a potential alternative in individual haplotyping.


Introduction
The availability of complete genome sequence for human beings [Vent et al. 2001] makes it possible to investigate genetic differences and to associate genetic variations with complex diseases [Hoehe et al. 2000]. Single nucleotide polymor phism (SNP)-a single DNA base varying from one individual to another, is believed to be the most frequent form to address genetic differences [Chakravarti, 1998;Li et al. 2005]. SNPs are found approximately every 1000 base pairs in the human genome and turn to be promising tools for doing disease association study. Many research works have been carried out for determining SNP sites or designing a detailed SNP map for human genome [Altshuler et al. 2000;Helmuth, 2001].
The nucleotides in a SNP position are called alleles. Almost all SNPs have two different alleles which we denote the wild type as 1 and the mutant type as −1. The SNP sequence information on each copy of a pair of chromosomes in a diploid genome is called a haplotype which is a string over {-1, 1}. A genotype is the conflated information of a pair of haplotypes on homologous chromosomes. For a genotype, if a pair of alleles at a SNP site is made of two identical values, this SNP site is called homozygous, otherwise it is called heterozygous.
Haplotypes generally have more information content than that individual SNPs have in disease as sociation studies [Stephens et al. 2001], but it is substantially more diffi cult to determine haplotypes than to determine genotypes or individual SNPs through experiments. Hence, computational methods that can reduce the cost of determining haplotypes become attractive alternatives. There are generally two classes of computational methods for determining haplotype. One class concerns with infering haplotypes from the genotype samples in a population. There are several models based on different assumptions on the biological system under consideration [Clark, 1990;Gusfi eld, 2002;Halperin and Eskin, 2004;Li et al. 2005;Wang and Xu, 2003]. The second class, called single individual haplotyping or haplotype assembly, is based on the data and methodology of shotgun sequence as sembly [Lancia et al. 2001;Lippert et al. 2002]. The input data consists of aligned short genome fragments with SNPs coming from DNA shotgun sequencing or data generated by a resequencing effort for the purpose of large-scale haplotyping. When we focus on SNP positions, these short genome fragments are actually the aligned SNP fragments. Computationally, the individual haplotyping problem is to determine the "best" pair of haplotypes from data (SNP fragments) which is possibly inconsistent and contradictory. It focuses on partitioning SNP fragments into two sets according to SNP states, with each set determining a haplotype [Lancia et al. 2001;Lippert et al. 2002]. For such a problem, there are several models based on different error assumptions [Lancia et al. 2001;Li, L.M. et al. 2004;Lippert et al. 2002;Rizzi et al. 2002;Wang et al. 2005], of which the Minimum Error Correction model (MEC) is widely adopted. MEC assumes that the inconsistence of the data comes from realistic sequence errors and these errors can be corrected. However, MEC is only effective in the case that SNP fragments have a low error rate. When the error rate of SNP fragments is relatively high, we can not reconstruct original haplotypes with a high accuracy by error correction only [Wang et al. 2005]. To improve the haplotyping quality, we need to either reduce the errors in SNP fragments which will call for improvement of the shortgun experiment, or add extra information to the given SNP fragment set. Since genotype data can be much more easily (and also economically) obtained, a computational model combining both SNP fragments and genotype information will be a realistic strategy. The idea is motivated by the method in ]. In their paper they phased long genotypes using local haplotype information. In this paper, we propose a new computational model (Minimum Conflict Individual Haplotyping: MCIH) for individual haplotyping by using this strategy, which is also proved to be NP-hard.
There are two ways to show that a new established model is practical or more effective than an existing model: one is to theoretically prove that the solution of the new model is superior to that of the existing model, the other is to numerically solve the problem and compare the solutions obtained by the two models. It is obvious that we can only do it in the second way. For this reason we try to design an exact algorithm for the MCIH problem. A special data structure suited for a dynamic programming algorithm is displayed and used to evaluate the model. Computational results by this exact algorithm confi rm the significance of MCIH. Moreover, a feed-forward neural network (FNN) is designed for computing approximate solutions to the problem (MCIH) in general case and of large size. Extensive computational results show the effectiveness of the proposed FNN algorithm and MCIH.
The paper is organized as follows: In Section I, we give the problem definition together with the complexity analysis of MCIH. The model evaluation is shown in Section II. In Section III, a feed-forward neural network algorithm is designed for implementing MCIH. Experiment results are given in Section IV. Section V concludes the paper.

I. Formulation and Problem
For a genotype g = (g 1 , g 2 , ···, g n ), if the jth SNP site is wild type homozygous, g j = 2; if it is mutant type homozygous, g j = −2; when it is heterozygous, g j = 0. A pair of haplotypes h 1 = (h 11 , h 12 , ···, h 1n ) and h 2 = (h 21 , h 22 , ···, h 2n ) is called compatible with a genotype g if the following conditions hold: for each SNP site j where g j = −2, h 1j = h 2j = −1; for each SNP site j where g j = 2, h 1j = h 2j = 1; for each SNP site j where g j = 0, Suppose that there are m SNP fragments from a pair of chromosomes and the length of each corresponding haplotype is n. Define an m × n SNP matrix M = (m i j ), whose entry m i j has the value −1,1 or 0 (for a missing or skipped base, we call it a hole). Each row m i corresponds to a SNP fragment f i and each column corresponds to a SNP site. Since the given SNP fragments may have different lengths, but are generally less than n, we also assign value 0 to the uncovered elements in a row. Let x,y ∈ {−1, 1, 0} and define then the distance between two SNP fragments If HD( f i , f k ) > 0, we say two fragments f i and f k are in conflict, otherwise we call them compatible. HD( f i , f k ) > 0 indicates that either f i , f k are not from the same chromosome copy or there are errors in the data. HD( f i , f k ) is similar to the Hamming distance, i.e., the number of mismatches (conflicts) between two fragments. The distance between a fragment and a haplotype is defined in the similar way.
The MEC problem [Lippert et al. 2002;Wang et al. 2005] is defined as: Given a set of SNP fragments, correct a minimum number of SNP states (−1 into 1 and vice versa), such that the modified SNP fragments can be divided into two disjoint sets of pairwise compatible fragments, and each set determines a haplotype. In order to add the information of genotype into MEC, we propose the following combinatorial optimization model for individual haplotyping: MCIH (Minimum Conflict Individual Haplotyping): Given a set of SNP fragments (a SNP matrix M) from an individual's DNA and the related genotype g, reconstruct a pair of haplotypes compatible with g and involving a minimum number of conflicts with the given SNP fragments.
MCIH is in fact to correct a minimum number of SNP states under the guidance of the genotype information so that the modified SNP fragments can be divided into two disjoint sets of pairwise compatible fragments which determine a pair of haplotypes compatible with the genotype. The computational complexity of the MCIH problem (similar spirit with ) is discussed in Appendix A where we prove it to be NP-hard. It indicates that the MCIH problem may have no effi cient algorithm for exact solutions.

II. Model Evaluation
The purpose of presenting the new model is to get a high-quality solution of the haplotype assembly problem. To show that MCIH is a potential alternative, we evaluate the model by studying its exact solutions. A special data structure with Markov property suited for a dynamic programming algorithm is considered as follows.
Let l i and r i be the beginning and ending positions of the ith SNP fragments f i on the SNP matrix respectively, 1 ≤ i ≤ m. For any two fragments f i and f j , we assume if l i ≤ l j , then r i ≤ r j . (2) The rows of the SNP matrix is reordered according to the beginning positions and the ending positions of SNP fragments. For two rows m i and m j , i < j if and only if l i < l j . If l i = l j and r i < r j , the ith row will also be put before the jth row. From the proof of the computational complexity in Appendix A we know that the MCIH with this special structure remains NP-hard.
To solve MCIH with this special structure, we give a dynamic programming (DP) algorithm. The input data of the DP algorithm are a genotype g with length n and a SNP matrix of the type (2). The outputs of the DP algorithm are a partition of the SNP fragments and a pair of halotypes (h 1 , h 2 ) generated from the partition. In fact, the DP algorithm is used to reconstruct one haplotype, say h 1 . Clearly, h 2 can be obtained immediately from g and h 1 .
Suppose that x( j) = ( x l j ,···, j r x ) is an assignment to the positions l j , ···, r j of h 1 , and x r ( j) is the corresponding assignment to h 2 at the same ) denote the number of conflicts between the SNP fragment f j and h 1 with assignment x( j), f 2 ( j, x r ( j)) denote the number of conflicts between the SNP fragment f j and h 2 with assignment )} which implies the haplotype that the jth fragment belongs to. The main steps of the dynamic programming algorithm are as follows: Step 1 Initialization.
Step 2 Follow the recurrence formula. Define N( j, x ( j)) as the minimum associated number of conflicts between the first j fragments and haplotypes (h 1 , h 2 ) at positions 1, 2, … , r j , with the assignment x ( j).
Step 3 Trace the solution to obtain a pair of haplotypes (h 1 , h 2 ) and a partition of the SNP fragments.
When all N( j, x( j)) for 1 ≤ j ≤ m are computed by the recurrence formula, we can find N(m, x(m)) for all x (m) = ( , , ).
x x l r m m g Thus, the solution of the MCIH problem can be ob tained by tracing the solution forwardly which leads to a minimal value at each j from the following formula: ( , ( )).
x min N m m For all j, f ( j, x( j)) = min{ f 1 ( j, x( j)), f 2 ( j, x r ( j))} determines a partition of the SNP fragments.
Assume that L is the maximum length of SNP fragments. It can be shown that the dynamic programming algorithm solving the special MCIH problem has the complexity O(2 2L m). Hence, the algorithm is exponential with the maximum length of SNP fragments. However, when the maximum length of SNP fragments is fixed, this algorithm is linear to m, i.e., the number of fragments. In real applications, the value of L is generally between 3 and 8.
As expected, numerical results (see Appendix B for details) of DP on the special data structure display that MCIH at the cost of genotype information improves the reconstruction rate greatly at various error rates of SNP fragments. An extreme case is considered where every SNP site is heterozygous. It means that the model has no homozygous site information available from the given genotype. The numerical results show that in this case MCIH still has a higher reconstruction rate than MEC. This indicates that MCIH is not trivial, i.e., a higher reconstruction rate does not only depend on homozygous site information. Heterozygous site information has much contribution to the accuracy of the reconstructed haplotypes.

III. A Feed-Forward Neural Network Algorithm
As discussed in Introduction, the MCIH problem is actually a classification problem. That is, given a set of SNP fragments, we want to classify it into two fragment subsets such that each subset determines a haplotype to solve the problem with minimum conflicts. It is well known that feedforward neural network (FNN) is a powerful tool for classification. In general, an FNN maps a set of objects into classes C 1 ,··· ,C s characterized by the attributes of the objects through repeatedly learning the objects and adjusting its parameters (neuron connection weights). In our problem, a set of SNP fragments are to be divided into two classes such that each class can be assembled as one of the haplotypes which will solve the problem with as few conflicts as possible.
The proposed FNN (see Figure 1) consists of three layers. The m input neurons represent m SNP fragments. Each input neuron accepts an n-dimensional vector on {1, −1, 0}. Two hidden neurons represent two subsets of fragments corresponding to one pair of haplotypes. The input dimension of the hidden neuron is also n and the outputs of the two neurons are a pair of tentative haplotypes. There is only one neuron in the last layer, which simply conflates the tentative haplotypes and outputs a tentative genotype. Comparing this tentative genotype with the given genotype will provide us the information to adjust the neuron connection weights by the popular back-propagation algorithm (see Appendix B and related literatures for feed-forward neural networks [Rumelhart et al. 1986;Zhang, 2000]). The main characteristic of the designed neural network is trying to achieve two objectives, i.e., compatibility and minimum number of conflicts simultaneously.

Forwarding process of FNN
The forwarding process of FNN can be stated as follows.
1) The inputs and outputs of the first layer are m rows of the SNP matrix, m 1 , ···, m m , i.e., m SNP fragments f i , i = 1, ···, m. In fact, the neurons in the first layer are trivial identity maps I.
2) There are 2m parameters, i.e., the weights from the first layer to the second layer, 5) The inputs to the third layer are h 1 and h 2 . The neuron function of the third layer is a linear summation, i.e., the output of the third layer is z = (z 1 , ···, z n ), where

1) Learning to satisfy objective-1
The distance between h l and m i (i.e., f i ) is defined as where d is defined by the formula (1) and sgn(x) = , , , .
x x Classify all SNP fragments into two disjoint sets according to their distances to h 1 and h 2 , i.e., For the neuron corresponding to h 1 in the second layer, the network learns to minimize the following error function between h 1 and the SNP fragments in F 1 : For the neuron corresponding to h 2 , the network learns to minimize the error function between h 2 and the SNP fragments in F 2 : 2) Learning to satisfy objective-2 The objective that the third layer adjusts to is to minimize the difference between the tentative genotype z and the original genotype g: A Back-Propagation-like procedure The main steps of the algorithm is as follows:
Step 3-Output Record a pair of haplotypes (ĥ 1 ,ĥ 2 ) and a genotype ĝ as the output.
The formulae to compute the derivatives in the algorithm are given in Appendix B. In contrast to DP algorithm that gives an exact solution, the algorithm based on the neural network can not ensure an exact optimal solution but is proved very effi cient for large-scale problems by the numerical results in next section.

IV. Simulation and Results
In this section, we will use both real data and simulation data to test MCIH and the FNN algorithm for haplotype reconstruction. The computation is implemented on a 2.26G Hz Pentium 4 processor PC using Microsoft Visual C++ compiler 6.
In our experiments, we use reconstruction rate (RR), the similarity degree between the original haplotypes and the reconstructed haplotypes, to measure performance of an algorithm or a model. Assume that h = (h 1 , h 2 ) is the original pair of haplotypes, and ĥ = (ĥ 1 ,ĥ 2 ) is the reconstructed pair of haplotypes. We define RR as: where r i j = HD (h i ,ĥ j ), i = 1, 2, j = 1, 2. We use compatibility rate (CR) to measure the similarity degree between the original genotype g and the reconstructed genotype ĝ: For a partition P = (F 1 , F 2 ) and a pair of haplotypes (h 1 , h 2 ), the corresponding conflict number is defined by (7). In this paper, we set step length ρ = 0.05, ε = 0.01 and λ = 0.1 (in fact, the algorithm is robust with these parameters) in the algorithm. In (15) and (16), we set L 1 = 0.2 and L 2 = 0.8.

Experiment on angiotensin-converting enzyme (ACE)
Angiotensin-converting enzyme catalyses the conversion of angiotensin I to the physiologically active peptide angiotensin II, which controls fluidelectrolyte balance and systemic blood pressure. Because it has a key function in the renin-angiotensin system, many association studies have been performed with DCP1 (encode angiotensinconverting enzyme). Literature [Rieder et al. 1999] completed the genomic sequencing of the DCP1 gene from 11 individuals and reported 78 SNP sites in 22 chromosomes. Out of the 78 varying sites, 52 are non-unique polymorphic sites.
Among these 11 individuals, there are two identical genotypes. We omit one of them. In addition, we omit the genotypes with no more than one heterozygous site whose haplotypes can be infered immediately. Now each of the 8 pairs of haplotypes is used to generate 15 instances in which SNP fragments are randomly generated according to different parameter settings: the number of SNP fragments m = 20; other parameters, the hole rate of fragments hr: 0.25, 0.5, 0.75; the error rate of fragments e: 0.05, 0.1, 0.15, 0.2 and 0.25.
The results of MCIH and MEC (solved by an exact algorithm in [Wang et al. 2005]) averaged on eight individuals is illustrated in Figure 2. All the results are obtained by running the algorithms only once. When the error rate of SNP fragments is low, the neural network is robust and effi cient. When the error rate of SNP fragment is high, the network may get into a plight of local minima. The genotype compatibility rate returned by the algorithm for most instances is 100% or at least larger than 98%. Only several instances with error rate 0.25 and hole rate 0.75 have genotype compatibility rate between 94% and 96%. In addition, the neural network algorithm solves each of these instances in several seconds. Figure 2 shows that MCIH has a much higher reconstruction rate at various parameter settings, which indicates that with additive genotype information MCIH is effective. Even if an approximate algorithm is employed, it is more effective for haplotype reconstruction than MEC.

Experiment on data from chromosome 5q31
Now we performed simulations using the data from public Daly set [Daly et al. 2001]. They reported a high-resolution analysis of a haplotype structure across 500kb on chromosome 5q31 using 103 SNPs in a European derived population which consists of 129 trios. The haplotypes of 129 children from the trios can be infered from the genotypes of their parents through pedigree information and the nontransmitted chromosomes as an extra 129 (pseudo) haplotype pairs. Markers for which both alleles could not be inferred are marked as missing. Among the resulting set of 258 haplotype pairs, the ones with more than 20% missing alleles are removed, which leaves us 147 haplotype pairs. Among these pairs, 18 genotypes with no more than one heterozygous site are omitted, then 129 pairs of haplotypes are left as the test set.
Each of the 129 pairs of haplotypes is used to generate 15 instances in which SNP fragments are randomly generated according to different parameter settings: m = 30; other parameters, hr: 0.25, 0.5, 0.75; e: 0.05, 0.1, 0.15, 0.2 and 0.25. The results of MCIH and MEC averaged on 129 pairs of haplotypes are illustrated in Figure 3.
The picture again shows that MCIH is quite effective and the designed back-propagation-like algorithm has a very good performance. The genotype compatibility rate returned by the algorithm for most instances is 100% or greater than 98%. Only a few instances with error rate 0.25 and hole rate 0.75 have genotype compatibility rate between 92% and 96%. In addition, the neural network algorithm solves instances with a low error rate and hole rate in several seconds. For a few instances with a high error rate and hole rate, however, it takes the algorithm several minutes to stop.

Experiments on Hudson's simulation data
We use a well-known program ms [Hudson, 2002] that uses coalescent theory to generate a simulated  population of haplotypes. ms has a parameter r as the recombination rate of population haplotypes. Firstly, let r = 0, then 20 haplotypes with 80 SNP sites are generated using ms and form a haplotype set. Then we randomly combine two halotypes in the haplotype set into a pair of individual haplotypes. 15 pairs of haplotypes are obtained by this way. Each of the 15 pairs of haplotypes is used to generate 15 instances according to the parameters as those in the last subsection. All of these instances form a dataset. The results of the two models averaged on 15 pairs of haplotypes are shown in Figure 4. To further evaluate MCIH, let r = 100 and other parameters be the same. The results of two models on this dataset are summarized in Figure 5. The genotype compatibility rate for almost all instances is 100% or larger than 98%.
From Figure 4 and Figure 5 we can see that MCIH is effective on haplotypes not only without recombination but also with a high recombination rate.

V. Conclusion
Individual haplotyping is an important problem in computational biology. In this paper, we proposed a new computational model for individual haplotyping-MCIH as an improvement of MEC and as an alternative way for biologists to solve the haplotyping problem more efficiently. The model is proved to be NP-hard.
To evaluate the new model, we displayed a special SNP matrix structure for which a dynamic programming algorithm can be used to solve MCIH exactly. Comparing the exact solutions of MCIH and MEC, we argue that the proposed MCIH is worth further studying. Due to the computational intractability of the MCIH problem, a feed-forward neural network is designed and a back-propagation-like procedure is formed as an efficient approximate algorithm. Computational results on multiple data sets show that the designed algorithm performs well and MCIH at the cost of additive genotype information has a higher accuracy of haplotype reconstruction than MEC, especially in the case of SNP fragments with a high error rate. Since genotype information can be obtained much easily and economically, the new model is practical as a supporting tool for reconstruction of haplotypes.

B. Evaluation MCIH on special SNP matrix-DP algorithm
In this part, we use simulation data to evaluate MCIH with an exact algorithm-dynamic pro gramming algorithm. 100 pairs of haplotypes with 25 SNP sites are randomly generated according to a parameter similarity rate s (the similarity degree between two haplotypes in one pair). Then, each of 100 pairs of haplotypes is used to generate 5 instances in which SNP fragments are randomly generated according to error rate e: 0.05, 0.1, 0.15, 0.2 and 0.25. The lengths of SNP fragments are between 6 and 8. The number of SNP fragments in each instance is between 20 and 40. We first let s = 0.5. The results of MCIH and MEC (solved by an exact algorithm in [Wang et al. 2005]) averaged on 100 pairs of haplotypes are summarized in Table 1. The dynamic programming algorithm solves each instance in no more than one second.
As expected, MCIH, which employs genotype information, raises the reconstruction rate greatly at various error rates of SNP fragments. In order to evaluate MCIH more objectively, we let s = 0, i.e., every SNP site in the genotypes corresponding to 100 pairs of haplotypes is a heterozygous site (this is an extreme case). Table 1 shows that in this case MCIH has a higher reconstruction rate than MEC even without employing homozygous site information. This indicates that MCIH is not trivial, i.e., a higher reconstruction rate does not only depend on homozygous site information. Heterozygous site information also has much contribution to the accuracy of the reconstructed haplotypes.